AITopics | safety checkers

Collaborating Authors

safety checkers

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis

Wang, Longtian, Xie, Xiaofei, Li, Tianlin, Zhi, Yuhan, Shen, Chao

arXiv.org Artificial IntelligenceMay-15-2025

--T ext-to-image (T2I) models have significantly advanced in producing high-quality images. However, such models have the ability to generate images containing not-safe-for-work (NSFW) content, such as pornography, violence, political content, and discrimination. T o mitigate the risk of generating NSFW content, refusal mechanisms, i.e., safety checkers, have been developed to check potential NSFW content. Adversarial prompting techniques have been developed to evaluate the robustness of the refusal mechanisms. The key challenge remains to subtly modify the prompt in a way that preserves its sensitive nature while bypassing the refusal mechanisms. In this paper, we introduce T okenProber, a method designed for sensitivity-aware differential testing, aimed at evaluating the robustness of the refusal mechanisms in T2I models by generating adversarial prompts. Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content. Thus, we conduct a fine-grained analysis of the impact of specific words within prompts, distinguishing between dirty words that are essential for NSFW content generation and discrepant words that highlight the different sensitivity assessments between T2I models and safety checkers. Through the sensitivity-aware mutation, T okenProbergenerates adversarial prompts, striking a balance between maintaining NSFW content generation and evading detection. Our evaluation of T okenProberagainst 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness in bypassing safety filters compared to existing methods ( e.g., 54%+ increase on average), highlighting T okenProber's ability to uncover robustness issues in the existing refusal mechanisms. The source code, datasets, and experimental results are available in [1]. Warning: This paper contains model outputs that are offensive in nature. The Text-to-Image (T2I) models have gained widespread attention due to their excellent capability in synthesizing high-quality images. T2I models, such as Stable Diffusion [2] and DALL E [3], process the textual descriptions provided by users, namely prompts, and output images that match the descriptions. Such models have been widely used to generate various types of images, for example, the Lexica [4] contains more than five million images generated by Stable Diffusion.

machine learning, natural language, safety checker, (19 more...)

arXiv.org Artificial Intelligence

2505.08804

Country:

Asia > Singapore (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.95)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.48)

Add feedback

Antelope: Potent and Concealed Jailbreak Attack Strategy

Zhao, Xin, Chen, Xiaojun, Gao, Haoyu

arXiv.org Artificial IntelligenceDec-11-2024

Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.08156

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Ma, Jiachen, Cao, Anda, Xiao, Zhiqing, Zhang, Jie, Ye, Chao, Zhao, Junbo

arXiv.org Artificial IntelligenceJun-2-2024

Text-to-Image (T2I) models have received widespread attention due to their remarkable generation capabilities. However, concerns have been raised about the ethical implications of the models in generating Not Safe for Work (NSFW) images because NSFW images may cause discomfort to people or be used for illegal purposes. To mitigate the generation of such images, T2I models deploy various types of safety checkers. However, they still cannot completely prevent the generation of NSFW images. In this paper, we propose the Jailbreak Prompt Attack (JPA) - an automatic attack framework. We aim to maintain prompts that bypass safety checkers while preserving the semantics of the original images. Specifically, we aim to find prompts that can bypass safety checkers because of the robustness of the text space. Our evaluation demonstrates that JPA successfully bypasses both online services with closed-box safety checkers and offline defenses safety checkers to generate NSFW images.

arxiv preprint arxiv, online service, safety checkers, (13 more...)

arXiv.org Artificial Intelligence

2404.02928

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Africa (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

State-of-the-Art in Nudity Classification: A Comparative Analysis

Akyon, Fatih Cagatay, Temizel, Alptekin

arXiv.org Artificial IntelligenceDec-26-2023

Stable This paper presents a comparative analysis of existing nudity Diffusion safety checker is designed to prevent unsafe image classification techniques for classifying images based generation, while LAION safety checker works by filtering on the presence of nudity, with a focus on their application out unwanted images from the training set to prevent diffusion in content moderation. The evaluation focuses on models being trained on inappropriate images. These CNN-based models, vision transformer, and popular opensource safety checkers demonstrate the growing importance of developing safety checkers from Stable Diffusion and Largescale effective and accurate image classification systems Artificial Intelligence Open Network (LAION).

content moderation, convnext, dataset, (14 more...)

arXiv.org Artificial Intelligence

2312.16338

Country: Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.05)

Genre: Research Report (1.00)

Industry:

Media > Film (0.47)
Leisure & Entertainment (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback